Model Selection

Ultra-long video understanding

# Ultra-long video understanding

Videochat Flash Qwen2 5 7B InternVideo2 1B

A multimodal video-text model built upon InternVideo2-1B and Qwen2.5-7B, using only 16 tokens per frame and supporting input sequences of up to 10,000 frames.

Transformers English

Videochat Flash Qwen2 5 2B Res448

VideoChat-Flash-2B is a multimodal model built upon UMT-L (300M) and Qwen2.5-1.5B, supporting video-to-text tasks with only 16 tokens per frame and extending the context window to 128k.

Transformers English

Llama Vid 7b Full 224 Video Fps 1

LLaMA-VID is an open-source multimodal chatbot fine-tuned from LLaMA/Vicuna, supporting hours-long video processing through extended context tokens.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase